Topic Modeling over Short Texts by Incorporating Word Embeddings
نویسندگان
چکیده
Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks, such as content charactering, user interest profiling, and emerging topic detecting. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. This paper studies how to incorporate the external word correlation knowledge into short texts to improve the coherence of topic modeling. Based on recent results in word embeddings that learn semantically representations for words from a large corpus, we introduce a novel method, Embedding-based Topic Model (ETM), to learn latent topics from short texts. ETM not only solves the problem of very limited word co-occurrence information by aggregating short texts into long pseudotexts, but also utilizes a Markov Random Field regularized model that gives correlated words a better chance to be put into the same topic. The experiments on real-world datasets validate the effectiveness of our model comparing with the state-of-the-art models.
منابع مشابه
Semantic Visualization for Short Texts with Word Embeddings
Semantic visualization integrates topic modeling and visualization, such that every document is associated with a topic distribution as well as visualization coordinates on a low-dimensional Euclidean space. We address the problem of semantic visualization for short texts. Such documents are increasingly common, including tweets, search snippets, news headlines, or status updates. Due to their ...
متن کاملUnsupervised Topic Modeling for Short Texts Using Distributed Representations of Words
We present an unsupervised topic model for short texts that performs soft clustering over distributed representations of words. We model the low-dimensional semantic vector space represented by the dense distributed representations of words using Gaussian mixture models (GMMs) whose components capture the notion of latent topics. While conventional topic modeling schemes such as probabilistic l...
متن کاملIntegrating Topic Modeling with Word Embeddings by Mixtures of vMFs
Gaussian LDA integrates topic modeling with word embeddings by replacing discrete topic distribution over word types with multivariate Gaussian distribution on the embedding space. This can take semantic information of words into account. However, the Euclidean similarity used in Gaussian topics is not an optimal semantic measure for word embeddings. Acknowledgedly, the cosine similarity better...
متن کاملImproving Twitter Sentiment Classification Using Topic-Enriched Multi-Prototype Word Embeddings
It has been shown that learning distributed word representations is highly useful for Twitter sentiment classification. Most existing models rely on a single distributed representation for each word. This is problematic for sentiment classification because words are often polysemous and each word can contain different sentiment polarities under different topics. We address this issue by learnin...
متن کاملMixed Membership Word Embeddings for Computational Social Science
Word embeddings improve the performance of NLP systems by revealing the hidden structural relationships between words. These models have recently risen in popularity due to the performance of scalable algorithms trained in the big data setting. Despite their success, word embeddings have seen very little use in computational social science NLP tasks, presumably due to their reliance on big data...
متن کامل